-
Notifications
You must be signed in to change notification settings - Fork 210
Llane/sdg ray docs #1347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Llane/sdg ray docs #1347
Conversation
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive documentation for the new Ray-based Synthetic Data Generation (SDG) capabilities in NeMo Curator. The documentation covers both simple multilingual Q&A generation and advanced NemotronCC pipelines for text transformation and knowledge extraction.
Key Changes
- Added tutorial README with quick start examples and command-line reference for all SDG scripts
- Created comprehensive documentation structure covering LLM client configuration, multilingual Q&A tutorials, and NemotronCC pipeline workflows
- Updated release notes to reflect SDG feature availability and removed the previous limitation note about SDG being under refactoring
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tutorials/synthetic/README.md | Enhanced tutorial README with detailed usage examples, command-line arguments table, and links to documentation |
| docs/index.md | Added Synthetic Data section to the main documentation table of contents |
| docs/curate-text/synthetic/index.md | Created overview page explaining SDG architecture, use cases, and available stages with mermaid diagram |
| docs/curate-text/synthetic/llm-client.md | Added comprehensive LLM client configuration guide covering NVIDIA API, vLLM, TGI endpoints with performance tuning |
| docs/curate-text/synthetic/multilingual-qa.md | Created step-by-step tutorial for generating multilingual Q&A pairs with code examples and CLI reference |
| docs/curate-text/synthetic/nemotron-cc/index.md | Documented NemotronCC pipeline architecture with composable pattern explanation and task configuration |
| docs/curate-text/synthetic/nemotron-cc/tasks.md | Created detailed reference for all five NemotronCC tasks with prompt templates and post-processing logic |
| docs/curate-text/index.md | Added Synthetic Data Generation card to the text curation index page |
| docs/about/release-notes/index.md | Added SDG feature announcement and removed previous limitation note |
| - NVIDIA API | ||
| - Base URL for the API endpoint | ||
| * - `--model-name` | ||
| - llama-3.3-70b |
Copilot
AI
Jan 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value "llama-3.3-70b" doesn't match the actual default used in the example script (synthetic_data_generation_example.py), which is "meta/llama-3.3-70b-instruct". Update this to match the actual implementation for consistency.
| - llama-3.3-70b | |
| - meta/llama-3.3-70b-instruct |
| ## Command-Line Arguments | ||
|
|
Copilot
AI
Jan 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The section header "Command-Line Arguments" discusses arguments across different scripts, but the title suggests these are universal. Consider adding clarifying text that differentiates between common arguments (used by multiple scripts) and script-specific arguments, or rename to "Command-Line Reference" for better clarity.
| ## Command-Line Arguments | |
| ## Command-Line Reference | |
| The arguments below are grouped into options shared across multiple example scripts and options specific to particular NemotronCC pipelines. Not every argument applies to every tutorial; refer to each script's `--help` output for the complete, authoritative list. |
Greptile SummaryThis PR adds comprehensive documentation for synthetic data generation (SDG) capabilities in NeMo Curator. The documentation includes a well-structured overview, LLM client configuration guide, multilingual Q&A tutorial, and detailed NemotronCC pipeline documentation with clear architecture diagrams and task references. Major additions:
Critical issue:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Pipeline
participant LLMClient
participant NVAPI as NVIDIA API/vLLM
participant Output
User->>Pipeline: Create SDG Pipeline
User->>Pipeline: Add QAMultilingualSyntheticStage or NemotronCC Stage
User->>Pipeline: Configure AsyncOpenAIClient
User->>Pipeline: pipeline.run()
Pipeline->>LLMClient: Initialize client with rate limiting
alt Multilingual Q&A
Pipeline->>LLMClient: Generate Q&A pairs in languages
LLMClient->>NVAPI: Async API calls (max_concurrent_requests)
NVAPI-->>LLMClient: Generated Q&A responses
LLMClient->>Pipeline: Return DocumentBatch
Pipeline->>Pipeline: Apply language filters (optional)
else NemotronCC Pipeline
Pipeline->>Pipeline: Preprocessing (tokenize, segment, filter)
Pipeline->>LLMClient: Transform documents via LLM
LLMClient->>NVAPI: Batch API calls with retry logic
NVAPI-->>LLMClient: Transformed text (paraphrased/QA/distilled)
LLMClient->>Pipeline: Return transformed data
Pipeline->>Pipeline: Postprocessing (cleanup, quality filter)
end
Pipeline->>Output: Write to JSONL/Parquet
Output-->>User: Generated synthetic data
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
docs/about/release-notes/index.md, line 21 (link)style: The "What's Next" section placeholder needs completion before release
11 files reviewed, 1 comment
huvunvidia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a very quick pass. Mainly to determine features/content that are generated by AI agent without manually tested.
Signed-off-by: Lawrence Lane <[email protected]>
Greptile's behavior is changing!From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
Signed-off-by: Lawrence Lane <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
| 2. **NeMo Curator with text extras** | ||
|
|
||
| ```bash | ||
| uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The installation command is inconsistent with the standard format used elsewhere in the documentation. The --extra-index-url https://pypi.nvidia.com flag is not needed with uv pip install, and the package name should be quoted.
The command should match the format used in other quickstart guides (e.g., docs/get-started/text.md line 52):
| uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12] | |
| uv pip install "nemo-curator[text_cuda12]" |
This ensures consistency with the rest of the documentation and follows the recommended installation pattern.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
tutorials/synthetic/README.md
Outdated
|
|
||
| - [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md) | ||
| - [NemotronCC Pipeline Documentation](../../docs/curate-text/synthetic/nemotron-cc/index.md) | ||
| - [Task Reference](../../docs/curate-text/synthetic/nemotron-cc/tasks.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing newline at end of file. Add a trailing newline for POSIX compliance.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
| # NeMo Curator Release Notes: {{ current_release }} | ||
|
|
||
| This major release represents a fundamental architecture shift from [Dask](https://www.dask.org/) to [Ray](https://www.ray.io/), expanding NeMo Curator to support multimodal data curation with new [video](../../curate-video/index.md) and [audio](../../curate-audio/index.md) capabilities. This refactor enables unified backend processing, better heterogeneous computing support, and enhanced autoscaling for dynamic workloads. | ||
| ## Synthetic Data Generation | ||
|
|
||
| **Migrating from a previous version of NeMo Curator?** Refer to the {ref}`Migration Guide <migration-guide>` for step-by-step instructions and the {ref}`Migration FAQ <migration-faq>` for common questions. | ||
| New Ray-based synthetic data generation capabilities for creating and augmenting training data using LLMs: | ||
|
|
||
| ## Installation Updates | ||
| - **LLM Client Infrastructure**: OpenAI-compatible async/sync clients with automatic rate limiting, retry logic, and exponential backoff | ||
| - **Multilingual Q&A Generation**: Generate synthetic Q&A pairs across multiple languages using customizable prompts | ||
| - **NemotronCC Pipelines**: Advanced text transformation and knowledge extraction workflows: | ||
| - **Wikipedia Paraphrasing**: Improve low-quality text by rewriting in Wikipedia-style prose | ||
| - **Diverse QA**: Generate diverse question-answer pairs for reading comprehension training | ||
| - **Distill**: Create condensed, information-dense paraphrases preserving key concepts | ||
| - **Extract Knowledge**: Extract factual content as textbook-style passages | ||
| - **Knowledge List**: Extract structured fact lists from documents | ||
|
|
||
| - **New Docker container**: Updated Docker infrastructure with CUDA 12.8.1 and Ubuntu 24.04 base; obtainable through the [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) (`nvcr.io/nvidia/nemo-curator:{{ container_version }}`) | ||
| - **Docker file to build own image**: Simplified [Dockerfile](https://github.com/NVIDIA-NeMo/Curator/blob/main/docker/Dockerfile) structure for custom container builds with FFmpeg support | ||
| - **UV source installations**: Integrated UV package manager (v0.8.22) for faster dependency management | ||
| - **PyPI improvements**: Enhanced PyPI installation with modular extras for targeted functionality: | ||
| Learn more in the [Synthetic Data Generation documentation](../../curate-text/synthetic/index.md). | ||
|
|
||
| ```{list-table} Available Installation Extras | ||
| :header-rows: 1 | ||
| :widths: 25 35 40 | ||
| * - Extra | ||
| - Installation Command | ||
| - Description | ||
| * - **All Modalities** | ||
| - `nemo-curator[all]` | ||
| - Complete installation with all modalities and GPU support | ||
| * - **Text Curation** | ||
| - `nemo-curator[text_cuda12]` | ||
| - GPU-accelerated text processing with RAPIDS | ||
| * - **Image Curation** | ||
| - `nemo-curator[image_cuda12]` | ||
| - Image processing with NVIDIA DALI | ||
| * - **Audio Curation** | ||
| - `nemo-curator[audio_cuda12]` | ||
| - Speech recognition with NeMo ASR models | ||
| * - **Video Curation** | ||
| - `nemo-curator[video_cuda12]` | ||
| - Video processing with GPU acceleration | ||
| * - **Basic GPU** | ||
| - `nemo-curator[cuda12]` | ||
| - CUDA utilities without modality-specific dependencies | ||
| ``` | ||
|
|
||
| All GPU installations require the NVIDIA PyPI index: | ||
| ```bash | ||
| uv pip install https://pypi.nvidia.com nemo-curator[EXTRA] | ||
| ``` | ||
|
|
||
| ## New Modalities | ||
|
|
||
| ### Video | ||
|
|
||
| NeMo Curator now supports comprehensive [video data curation](../../curate-video/index.md) with distributed processing capabilities: | ||
|
|
||
| - **Video splitting**: [Fixed-stride](../../curate-video/process-data/clipping.md) and [scene-change detection (TransNetV2)](../../curate-video/process-data/clipping.md) for clip extraction | ||
| - **Semantic deduplication**: [K-means clustering and pairwise similarity](../../curate-video/process-data/dedup.md) for near-duplicate clip removal | ||
| - **Content filtering**: [Motion-based filtering](../../curate-video/process-data/filtering.md) and [aesthetic filtering](../../curate-video/process-data/filtering.md) for quality improvement | ||
| - **Embedding generation**: InternVideo2 and Cosmos-Embed1 models for clip-level embeddings | ||
| - **Enhanced captioning**: [VL-based caption generation with optional LLM-based rewriting](../../curate-video/process-data/captions-preview.md) (Qwen-VL and Qwen-LM supported) for detailed video descriptions | ||
| - **Ray-based distributed architecture**: Scalable video processing with [autoscaling support](../concepts/video/architecture.md) | ||
|
|
||
| ### Audio | ||
|
|
||
| New [audio curation capabilities](../../curate-audio/index.md) for speech data processing: | ||
|
|
||
| - **ASR inference**: [Automatic speech recognition](../../curate-audio/process-data/asr-inference/index.md) using NeMo Framework pretrained models | ||
| - **Quality assessment**: [Word Error Rate (WER) and Character Error Rate (CER)](../../curate-audio/process-data/quality-assessment/index.md) calculation | ||
| - **Speech metrics**: [Duration analysis and speech rate metrics](../../curate-audio/process-data/audio-analysis/index.md) (words/characters per second) | ||
| - **Text integration**: Seamless integration with [text curation workflows](../../curate-audio/process-data/text-integration/index.md) via `AudioToDocumentStage` | ||
| - **Manifest support**: JSONL manifest format for audio file management | ||
|
|
||
| ## Modality Refactors | ||
|
|
||
| ### Text | ||
|
|
||
| - **Ray backend migration**: Complete transition from Dask to Ray for distributed [text processing](../../curate-text/index.md) | ||
| - **Improved model-based classifier throughput**: Better overlapping of compute between tokenization and inference through [length-based sequence sorting](../../curate-text/process-data/quality-assessment/distributed-classifier.md) for optimal GPU memory utilization | ||
| - **Task-centric architecture**: New `Task`-based processing model for finer-grained control | ||
| - **Pipeline redesign**: Updated `ProcessingStage` and `Pipeline` architecture with resource specification | ||
|
|
||
| ### Image | ||
|
|
||
| - **Pipeline-based architecture**: Transitioned from legacy `ImageTextPairDataset` to modern [stage-based processing](../../curate-images/index.md) with `ImageReaderStage`, `ImageEmbeddingStage`, and filter stages | ||
| - **DALI-based image loading**: New `ImageReaderStage` uses NVIDIA DALI for high-performance WebDataset tar shard processing with GPU/CPU fallback | ||
| - **Modular processing stages**: Separate stages for [embedding generation](../../curate-images/process-data/embeddings/index.md), [aesthetic filtering](../../curate-images/process-data/filters/aesthetic.md), and [NSFW filtering](../../curate-images/process-data/filters/nsfw.md) | ||
| - **Task-based data flow**: Images processed as `ImageBatch` tasks containing `ImageObject` instances with metadata, embeddings, and classification scores | ||
|
|
||
| Learn more about [image curation](../../curate-images/index.md). | ||
|
|
||
| ## Deduplication Improvements | ||
|
|
||
| Enhanced deduplication capabilities across all modalities with improved performance and flexibility: | ||
|
|
||
| - **Exact and Fuzzy deduplication**: Updated [rapidsmpf-based shuffle backend](../../reference/infrastructure/gpu-processing.md) for more efficient GPU-to-GPU data transfer and better spilling capabilities | ||
| - **Semantic deduplication**: Support for deduplicating [text](../../curate-text/process-data/deduplication/semdedup.md) and [video](../../curate-video/process-data/dedup.md) datasets using unified embedding-based workflows | ||
| - **New ranking strategies**: Added `RankingStrategy` which allows you to rank elements within cluster centers to decide which point to prioritize during duplicate removal, supporting [metadata-based ranking](../../curate-text/process-data/deduplication/semdedup.md) to prioritize specific datasets or inputs | ||
|
|
||
| ## Core Refactors | ||
|
|
||
| The architecture refactor introduces a layered system with unified interfaces and multiple execution backends: | ||
|
|
||
| ```{mermaid} | ||
| graph LR | ||
| subgraph "User Layer" | ||
| P[Pipeline] | ||
| S1[ProcessingStage X→Y] | ||
| S2[ProcessingStage Y→Z] | ||
| S3[ProcessingStage Z→W] | ||
| R[Resources<br/>CPU/GPU/NVDEC/NVENC] | ||
| end | ||
| subgraph "Orchestration Layer" | ||
| BE[BaseExecutor Interface] | ||
| end | ||
| subgraph "Backend Layer" | ||
| XE[XennaExecutor<br/>Production Ready] | ||
| RAP[RayActorPoolExecutor<br/>Experimental] | ||
| RDE[RayDataExecutor<br/>Experimental] | ||
| end | ||
| subgraph "Adaptation Layer" | ||
| XA[Xenna Adapter] | ||
| RAPA[Ray Actor Adapter] | ||
| RDA[Ray Data Adapter] | ||
| end | ||
| subgraph "Execution Layer" | ||
| X[Cosmos-Xenna<br/>Streaming/Batch] | ||
| RAY1[Ray Actor Pool<br/>Load Balancing] | ||
| RAY2[Ray Data API<br/>Dataset Processing] | ||
| end | ||
| P --> S1 | ||
| P --> S2 | ||
| P --> S3 | ||
| S1 -.-> R | ||
| S2 -.-> R | ||
| S3 -.-> R | ||
| P --> BE | ||
| BE --> XE | ||
| BE --> RAP | ||
| BE --> RDE | ||
| XE --> XA | ||
| RAP --> RAPA | ||
| RDE --> RDA | ||
| XA --> X | ||
| RAPA --> RAY1 | ||
| RDA --> RAY2 | ||
| style XE fill:#90EE90 | ||
| style RAP fill:#FFE4B5 | ||
| style RDE fill:#FFE4B5 | ||
| style P fill:#E6F3FF | ||
| style BE fill:#F0F8FF | ||
| ``` | ||
|
|
||
| ### Pipelines | ||
|
|
||
| - **New Pipeline API**: Ray-based pipeline execution with `BaseExecutor` interface | ||
| - **Multiple backends**: Support for [Xenna, Ray Actor Pool, and Ray Data execution backends](../../reference/infrastructure/execution-backends.md) | ||
| - **Resource specification**: Configurable CPU and GPU memory requirements per stage | ||
| - **Stage composition**: Improved stage validation and execution orchestration | ||
|
|
||
| ### Stages | ||
|
|
||
| - **ProcessingStage redesign**: Generic `ProcessingStage[X, Y]` base class with type safety | ||
| - **Resource requirements**: Built-in resource specification for CPU and GPU memory | ||
| - **Backend adapters**: Stage adaptation layer for different Ray orchestration systems | ||
| - **Input/output validation**: Enhanced type checking and data validation | ||
|
|
||
| ## Tutorials | ||
|
|
||
| - **Text tutorials**: Updated all [text curation tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/text) to use new Ray-based API | ||
| - **Image tutorials**: Migrated [image processing tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/image) to unified backend | ||
| - **Audio tutorials**: New [audio curation tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/audio) | ||
| - **Video tutorials**: New [video processing tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/video) | ||
|
|
||
| For all tutorial content, refer to the [tutorials directory](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials) in the NeMo Curator GitHub repository. | ||
|
|
||
| ## Known Limitations | ||
|
|
||
| > (Pending Refactor in Future Release) | ||
| ### Generation | ||
|
|
||
| - **Synthetic data generation**: Synthetic text generation features are being refactored for Ray compatibility | ||
| - **Hard negative mining**: Retrieval-based data generation workflows under development | ||
|
|
||
| ### PII | ||
|
|
||
| - **PII processing**: Personal Identifiable Information removal tools are being updated for Ray backend | ||
| - **Privacy workflows**: Enhanced privacy-preserving data curation capabilities in development | ||
|
|
||
| ### Blending & Shuffling | ||
|
|
||
| - **Data blending**: Multi-source dataset blending functionality being refactored | ||
| - **Dataset shuffling**: Large-scale data shuffling operations under development | ||
|
|
||
| ## Docs Refactor | ||
|
|
||
| - **Local preview capability**: Improved documentation build system with local preview support | ||
| - **Modality-specific guides**: Comprehensive documentation for each supported modality ([text](../../curate-text/index.md), [image](../../curate-images/index.md), [audio](../../curate-audio/index.md), [video](../../curate-video/index.md)) | ||
| - **API reference**: Complete [API documentation](../../apidocs/index.rst) with type annotations and examples | ||
|
|
||
| --- | ||
|
|
||
| ## What's Next | ||
|
|
||
| The next release will focus on completing the refactor of Synthetic Data Generation, PII, and Blending & Shuffling features, along with additional performance optimizations and new modality support. | ||
| The next release will focus on ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The release notes have been drastically reduced from 231 lines to just 44 lines, removing all the comprehensive release information about the v26.02 release. The original content included detailed information about:
- Installation updates (Docker, UV, PyPI improvements)
- New modalities (Video and Audio support)
- Modality refactors (Text and Image improvements)
- Deduplication improvements
- Core architecture refactors (Pipeline and Stage redesigns)
- Tutorials updates
- Known limitations
This content should not be completely replaced. Instead, the SDG documentation should be added to the existing release notes as a new section, not used as a replacement for all the other release information.
The release notes serve as a critical reference for users understanding what changed in the release. Removing this information will confuse users and make it difficult for them to understand the full scope of the v26.02 release.
| | **[Multilingual Q&A Generation](synthetic_data_generation_example.py)** | Generate synthetic Q&A pairs in multiple languages using LLMs | `synthetic_data_generation_example.py` | | ||
| --- | ||
|
|
||
| ## Additional Resources- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing newline after the "Additional Resources" heading. The markdown formatting is incorrect - there should be a blank line between the heading and the list items.
| ## Additional Resources- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md) | |
| ## Additional Resources | |
| - [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
| | **[Multilingual Q&A Generation](synthetic_data_generation_example.py)** | Generate synthetic Q&A pairs in multiple languages using LLMs | `synthetic_data_generation_example.py` | | ||
| --- | ||
|
|
||
| ## Additional Resources- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing newline between heading and bullet list. Add a blank line after "## Additional Resources" for proper markdown rendering.
| ## Additional Resources- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md) | |
| ## Additional Resources | |
| - [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md) |
Signed-off-by: Lawrence Lane <[email protected]>
…urator into llane/sdg-ray-docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
tutorials/synthetic/README.md, line 103 (link)style: Missing trailing newline at end of file.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
12 files reviewed, 3 comments
| ## Synthetic Data Generation | ||
|
|
||
| **Migrating from a previous version of NeMo Curator?** Refer to the {ref}`Migration Guide <migration-guide>` for step-by-step instructions and the {ref}`Migration FAQ <migration-faq>` for common questions. | ||
| New Ray-based synthetic data generation capabilities for creating and augmenting training data using LLMs: | ||
|
|
||
| ## Installation Updates | ||
| - **LLM Client Infrastructure**: OpenAI-compatible async/sync clients with automatic rate limiting, retry logic, and exponential backoff | ||
| - **Multilingual Q&A Generation**: Generate synthetic Q&A pairs across multiple languages using customizable prompts | ||
| - **NemotronCC Pipelines**: Advanced text transformation and knowledge extraction workflows: | ||
| - **Wikipedia Paraphrasing**: Improve low-quality text by rewriting in Wikipedia-style prose | ||
| - **Diverse QA**: Generate diverse question-answer pairs for reading comprehension training | ||
| - **Distill**: Create condensed, information-dense paraphrases preserving key concepts | ||
| - **Extract Knowledge**: Extract factual content as textbook-style passages | ||
| - **Knowledge List**: Extract structured fact lists from documents | ||
|
|
||
| - **New Docker container**: Updated Docker infrastructure with CUDA 12.8.1 and Ubuntu 24.04 base; obtainable through the [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) (`nvcr.io/nvidia/nemo-curator:{{ container_version }}`) | ||
| - **Docker file to build own image**: Simplified [Dockerfile](https://github.com/NVIDIA-NeMo/Curator/blob/main/docker/Dockerfile) structure for custom container builds with FFmpeg support | ||
| - **UV source installations**: Integrated UV package manager (v0.8.22) for faster dependency management | ||
| - **PyPI improvements**: Enhanced PyPI installation with modular extras for targeted functionality: | ||
| Learn more in the [Synthetic Data Generation documentation](../../curate-text/synthetic/index.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The release notes have been reduced from 231 lines to 44 lines, removing all comprehensive v26.02 release information including Docker updates, PyPI improvements, video/audio modalities, deduplication improvements, and architecture refactors. The SDG documentation should be added to existing release notes, not replace them entirely. Users need the full scope of v26.02 changes for understanding what's new in the release.
| | **[Multilingual Q&A Generation](synthetic_data_generation_example.py)** | Generate synthetic Q&A pairs in multiple languages using LLMs | `synthetic_data_generation_example.py` | | ||
| --- | ||
|
|
||
| ## Additional Resources- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Missing blank line between heading and list.
| ## Additional Resources- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md) | |
| ## Additional Resources | |
| - [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md) |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
doc doesn't mention on how to generate the data required in below CLI: Process Parquet input filespython nemotron_cc/nemotron_cc_sdg_high_quality_example_pipeline.py |
initial pass at creating SDG docs
Note: I'll be out next week, but feel free to leave any changes and i'll get to them ASAP